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ABSTRACT 


Now a day's people are facing lots of problem related to health. Diseases are also increasing due to increase number of populations. The survey helps to identify how the 
data mining techniques predict the thyroid disorder at earlier stage. Classification techniques play very important role to identify the disease in medical data. In this 
paper, the main objective is to classify the data as thyroid or non-thyroid and improve the classification accuracy. We have proposed robust ensemble model using 
various classification techniques like random forest, Naive Bayes and K-Nearest Neighbors (K-NN). The proposed model gives better classification accuracy as 
93.55%.We have also applied the feature optimization technique that is optimized selection to eliminate the irrelevant feature from data set and computationally 
improve the performance of model .The proposed model achieved better classification technique as 97.61% of accuracy with reduced 3 feature subset. 
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I. INTRODUCTION 

In medical science, diagnosis of health condition is very challenging task. Peo- 
ples are facing various health disease problems in which thyroid is very critical 
problem faced by the human being. Thyroid decease classification is one of the 
important problem in medical science because it is directly related to health of 
human body, these type of decease can be solve by proper and carefully treat- 
ments. A modern medical diagnosis system based on decision based system and 
find the problem based on classification of data. The main purpose of this work is 
to study about the Thyroid Disease with the help of Data Mining Techniques. 
Data mining plays a vital role in medical field for diagnosis of disease. It offers 
lot of classification techniques to predict the disease accuracy. 


There are various authors have worked for classification of thyroid disease. S. 

Gaikwad et.al. [1] have suggested random forest for classification of thyroid 
data. The suggested model gives 96.63% of accuracy. A. Upadhyay, et al. [2] 

have used two decision tree classifier as C4.5 and C5.0 for classification of thy- 

roid disease. C5.0 model gives 95% of accuracy which is better than C4.5 classi- 

fier N. Sigh [3] has suggested Support Vector Machine (SVM) is better classifier 
as compared to K-NN and Bayesian Net. Accuracy of SVM gives 84.62% of 
accuracy. M. C. Frates [4] suggested different image classifiers are Artificial Neu- 
ral Networks (ANN), Support Vector Machines (SVM), Fuzzy measures, 

Genetic Algorithms (GA), Fuzzy support Vector Machines (FSVM) for classifi- 

cation of thyroid disease. The textural features in ANN help to resolve 

misclassification. SVM is the best available machine learning algorithms in clas- 

sifying high-dimensional data sets. .D. Kerana Hanirex et al. [7] have suggested 

NNge model for classification of thyroid disease. NNge classifier gives 96.44% 

of accuracy with reduced number of features. Lavanya, D., et al. [8] have sug- 

gested CART classifier and compared with other decision tree classifier as C4.5 

and ID3 for classification of thyroid data. The CART achieved highest accuracy 

as 94.68% as best model. S. Panday et al. [12] have used various classifiers like 

C4.5, Random Forest, Multilayer preceptor and Bayes Net for classification of 
thyroid data. The classifier C4.5 gives better classification accuracy compare to 

others. K.Geeta et al. (2016) [13] have propped Evolutionary Multivariate 

Bayesian prediction classifier for classification of thyroid disease. K. Rajam 
[14] has discussed the use of data mining techniques for classification of thyroid 

disease and specially explore as Naive bayes, decision tree, back propagation, 

support vector machine in the context of thyroid disease. 


In this paper various data mining techniques like Random Forest, Naive Bayes 
and K-NN are used to develop classifier for diagnosis and classification of thy- 
roid disease. A data set downloaded from UCI repository site is used for the 
experimental purpose, entire work is carried out with Rapid Miner Studio soft- 
ware under Windows 7 environment. 


II. DATA SET DESCRIPTION 

Thyroid dataset is taken from UCI machine learning repository [5]. Dataset is 
given from Garavan institute and documentation is given by Ross Quinlan. Data- 
base consists of patients records. Each record is having 29 features, 7547 
instances and | class having thyroid and non thyroid. Features are boolean or con- 
tinuous valued. The features are namely Age, Sex, On thyroxine, Query on thy- 
roxine, On antithyroid medication, Sick, Pregnant, Thyroid surgery, I] 1treatment 
, Query hypothyroid, Query hyperthyroid, Lithium, Goitre, Tumor, Hypo pitu- 
itary, Psych, TSH measured, , TSH, T3 measured, T3, TT4 measured, TT4, T4U 
measured, T4U, FTI measured, FTI, TBG measured, TBG and Referral source. 


II. CLASSIFICATION TECHNIQUES 

> Decision tree [6] is very popular data mining technique. A decision tree is a 
structure that includes a root node, branches, and leaf nodes. Each internal 
node denotes a test on an attribute, each branch denotes the outcome of a 
test, and each leaf node holds a class label. The topmost node in the tree is 
the root node. In this research work we have used Random forest for classifi- 
cation thyroid data. 


Random Forest (or RF) [9] is an ensemble classifier that consists of many 
decision trees and outputs the class that is the mode of the classes output by 
individual trees. Random Forests are often used when we have very large 
training datasets and a very large number of input variables (hundreds or 
even thousands of input variables). A random forest model is typically made 
up of tens or hundreds of decision trees. 


> Bayesian classification [6] 1s based on Bayes' Theorem. Bayesian classifiers 
are the statistical classifiers. Bayesian classifiers can predict class member- 
ship probabilities such as the probability that a given tuple belongs to a par- 
ticular class. Classification algorithms have found a simple Bayesian classi- 
fier known as naive Bayes classifier to be comparable in performance with 
decision tree and selected neural network classifiers. Bayesian classifiers 
have also exhibited high accuracy and speed when applied to large dataset. 


> The k-nearest-neighbor method [6] was first described in the early 1950s. 
The method is labor intensive when given large training sets; It has since 
been widely used in the area of pattern recognition. Nearest-neighbor classi- 
fiers are based on learning by analogy, that is, by comparing a given test 
tuple with training tuples that are similar to it. 


IV. FEATURE OPTIMIZATION AND ENSEMBLE MODEL 

> Feature selection [10] is an optimization process in which one tries to find 
the best feature subset from the fixed set of the original features, according 
to a given processing goal and feature selection criteria. A solution of an opti- 
mal feature selection does not need to be unique. Different subset of original 
features may guarantee accomplishing the same goal with the same perfor- 
mance measure. An optimal feature set will depend on data, processing 
goal, and the selection criteria being used. 


In this research work, we have used optimization selection technique is used 
to optimize the original feature set. This approach is used two deterministic 
greedy feature selection algorithms forward selection and backward elimi- 
nation are used for feature selection [15]. 


> Anensemble model [11] combines the output of several classifier produced 
by weak learner into a single composite classification. It can be used to 
reduce the error of any weak learning algorithm. The purpose of combining 
all these classifier together is to build a hybrid model which will improve 
classification accuracy as compared to each individual classifier. 


V. RESULTS AND DISCUSSION 

This research work done in Rapid miner data mining tools in window environ- 
ment. We have used various classification techniques like random Forest, Naive 
Bayes and K-NN for classification of thyroid disease. We have applied the thy- 
roid data set into classification techniques with 70-30% training-testing parti- 
tion. Individual's models are not giving satisfactory results. We have proposed 
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new ensemble model that is combination of Random Forest, Naive bayes and K- 
NN which gives better classification accuracy as 93.55% compare to other indi- 
viduals models. Table 1 shows that accuracy of individuals and proposed ensem- 
ble model. Fig.1 shows that graphical representation of confusion matrix of pro- 
posed ensemble model. The confusion matrix can be used to calculate the perfor- 
mance of models. Table 2 shows that performance measures of proposed ensem- 
ble model like sensitivity, specificity and accuracy to check the robustness of 
models. 


Feature optimization is optimizing the feature from original feature space. In this 
research work we have used optimization selection to optimize the feature subset 
and increase the computational time and accuracy of model. Table 3 shows accu- 
racy of proposed ensemble model with feature optimization technique. Our pro- 
posed ensemble model achieved better accuracy as 97.61% with 3 numbers of 
features. Finally our proposed ensemble model is better for classifying thyroid 
and non thyroid disease with high accuracy and less computation time. 


Table 1: Accuracy of models with 70-30% training —testing data partition 


Individual Model Name Accuracy (%) 


Random Forest+ Naive Bayes+ K-NN 93.55% 
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Fig 1: Graphical representation of confusion matrix of proposed model 


Table 2: Performance measures of proposed ensemble model 


Specificity 91.78% 


Table 3: Feature optimization technique on proposed ensemble model 





Accuracy (%) 
after feature 
selection 


Number 
of features 


Feature selection 


: Name of features 
technique 


Optimize Selection 
(forward & backward 
elimination) 


TSH, T3 measured , T3 





VI. CONCLUSION 

Identification of disease is very critical problem in medical science. Classifica- 
tion is one of the important techniques to classify the data as thyroid or non- 
thyroid disease. Various research works have done in the field of thyroid classifi- 
cation and different data mining techniques used to build robust classifiers. In 
this paper, new proposed ensemble model is developed for classification of thy- 
roid disease with high accuracy. We have also applied the feature optimization 
technique to computationally increase the performance of model. Our proposed 
model gives satisfactory result as 97.61% of accuracy with few numbers of fea- 
tures. 
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